Owner: Daniel Soukup - Created: 2025.11.01
In this notebook, we load the processed data and fit our models focusing on optimizing variations of XGBoost classifiers across multiple hyperparameters that balance variance and bias, while addressing the class imbalance discussed during EDA. We chose to go in-depth on comparing variations of this single model type to allow focus on the details.
NOTE: due to randomness in the model fitting and tuning process, rerunning the notebook might change the outputs (such as top predictors) and add inconsistencies with the current markdown.
Let's load our processed data and create feature/target dataframes for both train and test.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
processed_learn = dataiku.Dataset("processed_learn")
processed_learn_df = processed_learn.get_dataframe()
processed_test = dataiku.Dataset("processed_test")
processed_test_df = processed_test.get_dataframe()
processed_learn_df.shape, processed_test_df.shape
((152807, 44), (78826, 44))
processed_learn_df.head()
| class of worker_Not in universe | class of worker_Private | class of worker_Self-employed-not incorporated | class of worker_infrequent_sklearn | sex_Male | education_Children | education_High school graduate | education_Some college but no degree | education_infrequent_sklearn | marital stat_Married-civilian spouse present | marital stat_Never married | marital stat_Widowed | marital stat_infrequent_sklearn | full or part time employment stat_Full-time schedules | full or part time employment stat_Not in labor force | full or part time employment stat_infrequent_sklearn | detailed household and family stat_Child <18 never marr not in subfamily | detailed household and family stat_Householder | detailed household and family stat_Nonfamily householder | detailed household and family stat_Spouse of householder | detailed household and family stat_infrequent_sklearn | detailed household summary in household_Child under 18 never married | detailed household summary in household_Householder | detailed household summary in household_Other relative of householder | detailed household summary in household_Spouse of householder | detailed household summary in household_infrequent_sklearn | num persons worked for employer_1 | num persons worked for employer_2 | num persons worked for employer_3 | num persons worked for employer_4 | num persons worked for employer_6 | num persons worked for employer_infrequent_sklearn | family members under 18_Not in universe | family members under 18_infrequent_sklearn | tax filer stat_Nonfiler | tax filer stat_Single | tax filer stat_infrequent_sklearn | income | age | wage per hour | capital gains | capital losses | dividends from stocks | weeks worked in year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 73 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 58 | 0 | 0 | 0 | 0 | 52 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 18 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 9 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 |
We notice that some special characters can cause issues with our model - we address this here.
processed_learn_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_learn_df.columns]
processed_test_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_test_df.columns]
TARGET = 'income'
X_train, y_train = processed_learn_df.drop(columns=TARGET), processed_learn_df[TARGET]
X_test, y_test = processed_test_df.drop(columns=TARGET), processed_test_df[TARGET]
Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.
y_train.mean()
0.08062457871694359
Important Note: We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model as an unbiased estimate of our model on completely unseen data.
Our current approach will focus on optimizing XGBoost binary classifiers. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:
We will be recording our models and detailed metrics under two main experiments that will capture multiple runs.
project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('lV6oqreY')
TUNING_XP = "xgboost_hp_tuning"
BASELINE_XP = "baseline_xp"
As mentioned, we'll be using sample weights to adjust for the class imbalance:
from sklearn.utils.class_weight import compute_sample_weight
from typing import Union
def get_sample_weights(multiplier: Union[int, None]) -> np.array:
"""
Weight the minority sample higher to contribute more to the training loss.
"""
if multiplier:
return compute_sample_weight({0: 1, 1: multiplier}, y_train)
else:
return None
weights = get_sample_weights(10)
weights
array([1., 1., 1., ..., 1., 1., 1.])
In order to compare model variations, we need to split the train set into train and validation. For this, we set up our cross-validation helper and define base parameters for our model:
from typing import Dict, Any
import xgboost as xgb
def cross_val_score_xgb(param: Dict[str, Any]) -> float:
"""
Fit model with 3-fold cross validation using the provided params.
Return the avg out-of-fold metric as specified in the provided params.
"""
dtrain = xgb.DMatrix(
X_train,
label=y_train,
weight=get_sample_weights(param.get('multiplier')),
)
results = xgb.cv(
params=param,
dtrain=dtrain,
num_boost_round=param.get("n_estimators"), # default 10
nfold=3,
seed=42,
verbose_eval=False,
stratified=param.get("stratified_cv"), # default False
)
return results
# we wont change these
BASE_PARAMS = {
"verbosity": 0,
"objective": "binary:logistic",
"eval_metric": "aucpr", # adjusted for the imbalance
"stratified_cv": True # adjusted for the imbalance
}
param = BASE_PARAMS.copy()
param.update({
"n_estimators": 10,
"max_depth": 2,
"multiplier": 1,
}
)
Lets test our function with logging the run:
def run_cv_with_logging(param: dict) -> pd.DataFrame:
"""
Log the CV run with MLflow to BASELINE_XP:base_run.
Model are not saved unsave autologged.
"""
with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
mlflow_handle.set_experiment(BASELINE_XP)
with mlflow_handle.start_run(run_name="base_run", nested=True):
result = cross_val_score_xgb(param)
best_score = result["test-aucpr-mean"].values[-1]
# logging
mlflow_handle.log_params(param)
mlflow_handle.log_metrics(
{
'best_score': best_score
}
)
return result
results = run_cv_with_logging(param)
results.tail(3)
2025/11/04 02:32:47 INFO mlflow.tracking._tracking_service.client: 🏃 View run base_run at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp/runs/base_run_1kQ. 2025/11/04 02:32:47 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp.
| train-aucpr-mean | train-aucpr-std | test-aucpr-mean | test-aucpr-std | |
|---|---|---|---|---|
| 7 | 0.520371 | 0.007912 | 0.518399 | 0.005931 |
| 8 | 0.529616 | 0.007390 | 0.526744 | 0.002086 |
| 9 | 0.531546 | 0.007320 | 0.528351 | 0.001600 |
Lets try with a large multiplier:
param = BASE_PARAMS.copy()
param.update({
"n_estimators": 10,
"max_depth": 2,
"multiplier": 10,
}
)
results = run_cv_with_logging(param)
results.tail(3)
2025/11/04 02:32:49 INFO mlflow.tracking._tracking_service.client: 🏃 View run base_run at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp/runs/base_run_6iv. 2025/11/04 02:32:49 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp.
| train-aucpr-mean | train-aucpr-std | test-aucpr-mean | test-aucpr-std | |
|---|---|---|---|---|
| 7 | 0.868235 | 0.000756 | 0.866994 | 0.001782 |
| 8 | 0.870861 | 0.000259 | 0.869497 | 0.002276 |
| 9 | 0.877161 | 0.001560 | 0.876323 | 0.002977 |
We can see that the multiplier has a significant effect on the aucpr score.
We can also test the recommended scale_pos_weight parameter that helps balance classes. A typical value to consider based on the XGBoost recommendations is sum(negative instances) / sum(positive instances) and is supposed to assign a weight independent of the sample for the whole positive class - we're supposed to get similar results.
scale_pos_weight = (1 - y_train).sum()/y_train.sum()
scale_pos_weight
11.403165584415584
param = BASE_PARAMS.copy()
param.update({
"n_estimators": 10,
"max_depth": 2,
"scale_pos_weight": scale_pos_weight
}
)
results = run_cv_with_logging(param)
results.tail(3)
2025/11/04 02:32:51 INFO mlflow.tracking._tracking_service.client: 🏃 View run base_run at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp/runs/base_run_hsH. 2025/11/04 02:32:51 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/baseline_xp.
| train-aucpr-mean | train-aucpr-std | test-aucpr-mean | test-aucpr-std | |
|---|---|---|---|---|
| 7 | 0.467183 | 0.000913 | 0.464735 | 0.007800 |
| 8 | 0.472560 | 0.005156 | 0.471199 | 0.014328 |
| 9 | 0.508974 | 0.000292 | 0.507859 | 0.009758 |
Interestingly, we don't see as much of a difference so we'll leave this and explore in the future.
BASE_PARAMS.update({"scale_pos_weight": scale_pos_weight})
Next, we'll look to optimize the model hyperparameters more systematically.
The function below defines the HP space to explore (parameters and their ranges), focusing on 5 such parameters with known strong effect on model performance and regularization:
def objective(trial) -> float:
"""
Capture a single param combination and model fitting,
evaluated using cross-validation.
"""
param = BASE_PARAMS.copy()
param.update({
"subsample": trial.suggest_float("subsample", 0.2, 1.0), # default 1 - all rows
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), # default 1 - all columns
"n_estimators": trial.suggest_int("n_estimators", 10, 100, step=10), # default 100
"max_depth": trial.suggest_int("max_depth", 2, 20, step=2), # default 3
"multiplier": trial.suggest_int("multiplier", 1, 50)
})
with mlflow_handle.start_run(run_name="trial", nested=True):
result = cross_val_score_xgb(param)
best_score = result["test-aucpr-mean"].values[-1]
# logging
mlflow_handle.log_params(param)
mlflow_handle.log_metrics(
{
'best_score': best_score
}
)
return best_score
Finally, we are ready to run our study, currently consisting of 40 trials:
from xgboost import XGBClassifier
import optuna
N_TRIALS = 40
with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
mlflow_handle.set_experiment(TUNING_XP)
with mlflow_handle.start_run(run_name="study", nested=True) as study_run:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=N_TRIALS, timeout=600)
# logging
best_params = study.best_trial.params
mlflow_handle.log_metrics(
{
'best_score': study.best_trial.value
}
)
# refit best model
model = XGBClassifier(**study.best_trial.params)
model = model.fit(X_train, y_train)
# log best params & model
mlflow_handle.log_params(model.get_xgb_params())
mlflow_handle.xgboost.log_model(
model,
"xgboost_model",
input_example=X_train.head(10),
pip_requirements=['xgboost==2.1.1']
)
/opt/dataiku/code-env/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[I 2025-11-04 02:32:51,309] A new study created in memory with name: no-name-d7ae5999-cf6e-40a2-a804-f8eb00d085e1
2025/11/04 02:32:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_0E0.
2025/11/04 02:32:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:32:54,722] Trial 0 finished with value: 0.9614144704755057 and parameters: {'subsample': 0.825605290504805, 'colsample_bytree': 0.3832677250492007, 'n_estimators': 30, 'max_depth': 2, 'multiplier': 33}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:10 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_iXp.
2025/11/04 02:33:10 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:10,111] Trial 1 finished with value: 0.9602585583527449 and parameters: {'subsample': 0.8478540376710844, 'colsample_bytree': 0.7015912214514144, 'n_estimators': 70, 'max_depth': 16, 'multiplier': 40}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:22 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_UmW.
2025/11/04 02:33:22 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:22,852] Trial 2 finished with value: 0.9352989114189322 and parameters: {'subsample': 0.8537802558229641, 'colsample_bytree': 0.5048092205054556, 'n_estimators': 100, 'max_depth': 8, 'multiplier': 18}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:29 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_qPB.
2025/11/04 02:33:29 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:29,236] Trial 3 finished with value: 0.8770431863711123 and parameters: {'subsample': 0.6454366590104292, 'colsample_bytree': 0.2032229705153732, 'n_estimators': 40, 'max_depth': 16, 'multiplier': 9}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:34 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_qH8.
2025/11/04 02:33:34 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:34,799] Trial 4 finished with value: 0.9196273293096352 and parameters: {'subsample': 0.9869951877841878, 'colsample_bytree': 0.9928213221235087, 'n_estimators': 40, 'max_depth': 8, 'multiplier': 13}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:37 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_iTQ.
2025/11/04 02:33:37 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:37,966] Trial 5 finished with value: 0.7325278176678967 and parameters: {'subsample': 0.5993603638887082, 'colsample_bytree': 0.7067468671287755, 'n_estimators': 10, 'max_depth': 18, 'multiplier': 3}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:33:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_RuG.
2025/11/04 02:33:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:33:53,222] Trial 6 finished with value: 0.9308586156945656 and parameters: {'subsample': 0.43438290126027407, 'colsample_bytree': 0.25738664811471756, 'n_estimators': 100, 'max_depth': 12, 'multiplier': 23}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:34:02 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_P3f.
2025/11/04 02:34:02 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:34:02,204] Trial 7 finished with value: 0.9450246517622292 and parameters: {'subsample': 0.3537743988036588, 'colsample_bytree': 0.2609098767884359, 'n_estimators': 60, 'max_depth': 10, 'multiplier': 30}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:34:06 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_ppy.
2025/11/04 02:34:06 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:34:06,783] Trial 8 finished with value: 0.9468155257252358 and parameters: {'subsample': 0.21188353515886132, 'colsample_bytree': 0.6060459799549448, 'n_estimators': 30, 'max_depth': 8, 'multiplier': 43}. Best is trial 0 with value: 0.9614144704755057.
2025/11/04 02:34:17 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_caw.
2025/11/04 02:34:17 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:34:17,898] Trial 9 finished with value: 0.962510283795452 and parameters: {'subsample': 0.9380622452956018, 'colsample_bytree': 0.31770739094166367, 'n_estimators': 60, 'max_depth': 18, 'multiplier': 40}. Best is trial 9 with value: 0.962510283795452.
2025/11/04 02:34:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Lbz.
2025/11/04 02:34:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:34:35,432] Trial 10 finished with value: 0.9682529764904414 and parameters: {'subsample': 0.9835293443271642, 'colsample_bytree': 0.46027257146701206, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:34:53 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_CSf.
2025/11/04 02:34:53 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:34:53,727] Trial 11 finished with value: 0.9680302928203907 and parameters: {'subsample': 0.9734208042492666, 'colsample_bytree': 0.44389418737397, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:35:11 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_95h.
2025/11/04 02:35:11 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:35:11,111] Trial 12 finished with value: 0.9663668908164472 and parameters: {'subsample': 0.7313241089847379, 'colsample_bytree': 0.460061793876653, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:35:29 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_MEa.
2025/11/04 02:35:29 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:35:29,593] Trial 13 finished with value: 0.9680317912184249 and parameters: {'subsample': 0.9729809259647073, 'colsample_bytree': 0.5436127270434763, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:35:45 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_iq7.
2025/11/04 02:35:45 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:35:45,504] Trial 14 finished with value: 0.9635243850635035 and parameters: {'subsample': 0.7375920300491778, 'colsample_bytree': 0.6096376017021395, 'n_estimators': 80, 'max_depth': 14, 'multiplier': 45}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:35:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_WYj.
2025/11/04 02:35:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:35:54,137] Trial 15 finished with value: 0.9650796526359162 and parameters: {'subsample': 0.5771506567399569, 'colsample_bytree': 0.7165733662193164, 'n_estimators': 90, 'max_depth': 2, 'multiplier': 34}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:36:12 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_c0i.
2025/11/04 02:36:12 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:36:12,937] Trial 16 finished with value: 0.9556150253842213 and parameters: {'subsample': 0.8976153017725302, 'colsample_bytree': 0.9200048339345772, 'n_estimators': 70, 'max_depth': 20, 'multiplier': 37}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:36:31 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Oqm.
2025/11/04 02:36:31 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:36:31,627] Trial 17 finished with value: 0.940254961784663 and parameters: {'subsample': 0.7453735380295278, 'colsample_bytree': 0.5408102063212752, 'n_estimators': 90, 'max_depth': 16, 'multiplier': 26}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:36:41 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_bZy.
2025/11/04 02:36:41 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:36:41,165] Trial 18 finished with value: 0.9671482774282493 and parameters: {'subsample': 0.9922781002274723, 'colsample_bytree': 0.8109309422352831, 'n_estimators': 50, 'max_depth': 14, 'multiplier': 46}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:36:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_CqR.
2025/11/04 02:36:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:36:54,771] Trial 19 finished with value: 0.9632672234857976 and parameters: {'subsample': 0.49123791961248425, 'colsample_bytree': 0.38106032998418177, 'n_estimators': 70, 'max_depth': 18, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:37:04 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_dm8.
2025/11/04 02:37:04 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:37:04,619] Trial 20 finished with value: 0.9590210889712488 and parameters: {'subsample': 0.8082888335798499, 'colsample_bytree': 0.5779776774494827, 'n_estimators': 90, 'max_depth': 4, 'multiplier': 27}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:37:21 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_hhv.
2025/11/04 02:37:21 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:37:21,393] Trial 21 finished with value: 0.9664873662971282 and parameters: {'subsample': 0.9439489080742313, 'colsample_bytree': 0.44058254892978865, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 47}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:37:38 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_I9I.
2025/11/04 02:37:38 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:37:38,452] Trial 22 finished with value: 0.9682320854999182 and parameters: {'subsample': 0.8996071095160296, 'colsample_bytree': 0.4208010660521917, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 50}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:37:51 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_aID.
2025/11/04 02:37:51 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:37:51,833] Trial 23 finished with value: 0.9621088542632995 and parameters: {'subsample': 0.8904183230563689, 'colsample_bytree': 0.36762078810680027, 'n_estimators': 70, 'max_depth': 18, 'multiplier': 41}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:38:09 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_i2i.
2025/11/04 02:38:09 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:38:10,013] Trial 24 finished with value: 0.9589049499141008 and parameters: {'subsample': 0.9118793033469996, 'colsample_bytree': 0.5282809891660126, 'n_estimators': 100, 'max_depth': 14, 'multiplier': 37}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:38:31 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_rNv.
2025/11/04 02:38:31 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:38:31,565] Trial 25 finished with value: 0.9633659897459954 and parameters: {'subsample': 0.7964590064989514, 'colsample_bytree': 0.6379478099997608, 'n_estimators': 90, 'max_depth': 20, 'multiplier': 46}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:38:44 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_USv.
2025/11/04 02:38:44 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:38:44,853] Trial 26 finished with value: 0.9615117699658301 and parameters: {'subsample': 0.6504151135254745, 'colsample_bytree': 0.49341960681828345, 'n_estimators': 60, 'max_depth': 18, 'multiplier': 43}. Best is trial 10 with value: 0.9682529764904414.
2025/11/04 02:38:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_0G9.
2025/11/04 02:38:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:38:54,118] Trial 27 finished with value: 0.9696427400889741 and parameters: {'subsample': 0.8946212270290453, 'colsample_bytree': 0.33619875230945345, 'n_estimators': 50, 'max_depth': 16, 'multiplier': 50}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:03 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_wpI.
2025/11/04 02:39:03 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:03,181] Trial 28 finished with value: 0.9589840788924828 and parameters: {'subsample': 0.7696944912161932, 'colsample_bytree': 0.3240285474009349, 'n_estimators': 50, 'max_depth': 16, 'multiplier': 36}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:05 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_WTk.
2025/11/04 02:39:05 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:05,707] Trial 29 finished with value: 0.953824434670799 and parameters: {'subsample': 0.6766152610468457, 'colsample_bytree': 0.3966101771967614, 'n_estimators': 10, 'max_depth': 12, 'multiplier': 31}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:11 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Qqp.
2025/11/04 02:39:11 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:11,752] Trial 30 finished with value: 0.9371736913051004 and parameters: {'subsample': 0.8583588722244974, 'colsample_bytree': 0.3173350283550115, 'n_estimators': 30, 'max_depth': 18, 'multiplier': 21}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:22 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_x5c.
2025/11/04 02:39:22 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:22,543] Trial 31 finished with value: 0.9676551167859354 and parameters: {'subsample': 0.9335988940401728, 'colsample_bytree': 0.4143437666478649, 'n_estimators': 50, 'max_depth': 20, 'multiplier': 48}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:36 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_x7F.
2025/11/04 02:39:36 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:36,401] Trial 32 finished with value: 0.9652850186921381 and parameters: {'subsample': 0.9984226044618867, 'colsample_bytree': 0.48256825127963376, 'n_estimators': 70, 'max_depth': 16, 'multiplier': 44}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:39:55 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_t4l.
2025/11/04 02:39:55 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:39:55,368] Trial 33 finished with value: 0.959741923487229 and parameters: {'subsample': 0.8572029226879303, 'colsample_bytree': 0.5605766461438807, 'n_estimators': 80, 'max_depth': 20, 'multiplier': 40}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:40:05 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_oXX.
2025/11/04 02:40:05 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:05,225] Trial 34 finished with value: 0.9669935032538625 and parameters: {'subsample': 0.8243191294856571, 'colsample_bytree': 0.6608052757325548, 'n_estimators': 40, 'max_depth': 18, 'multiplier': 48}. Best is trial 27 with value: 0.9696427400889741.
2025/11/04 02:40:14 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_JGt.
2025/11/04 02:40:14 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:14,867] Trial 35 finished with value: 0.9707999718025174 and parameters: {'subsample': 0.8699372174220117, 'colsample_bytree': 0.25372555634768734, 'n_estimators': 60, 'max_depth': 16, 'multiplier': 50}. Best is trial 35 with value: 0.9707999718025174.
2025/11/04 02:40:23 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_NpE.
2025/11/04 02:40:23 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:23,645] Trial 36 finished with value: 0.9671804348254622 and parameters: {'subsample': 0.8716328163119681, 'colsample_bytree': 0.20737033413177045, 'n_estimators': 60, 'max_depth': 14, 'multiplier': 42}. Best is trial 35 with value: 0.9707999718025174.
2025/11/04 02:40:30 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_uzk.
2025/11/04 02:40:30 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:30,927] Trial 37 finished with value: 0.8076142569150232 and parameters: {'subsample': 0.6989762913269659, 'colsample_bytree': 0.26153262071408, 'n_estimators': 50, 'max_depth': 10, 'multiplier': 4}. Best is trial 35 with value: 0.9707999718025174.
2025/11/04 02:40:36 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_pWt.
2025/11/04 02:40:36 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:36,842] Trial 38 finished with value: 0.9016284959054728 and parameters: {'subsample': 0.8049974922809472, 'colsample_bytree': 0.34085061697250385, 'n_estimators': 30, 'max_depth': 16, 'multiplier': 12}. Best is trial 35 with value: 0.9707999718025174.
2025/11/04 02:40:40 INFO mlflow.tracking._tracking_service.client: 🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_V0K.
2025/11/04 02:40:40 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
[I 2025-11-04 02:40:40,600] Trial 39 finished with value: 0.9629676689684133 and parameters: {'subsample': 0.9143056142133974, 'colsample_bytree': 0.27994425965866454, 'n_estimators': 20, 'max_depth': 12, 'multiplier': 39}. Best is trial 35 with value: 0.9707999718025174.
/opt/dataiku/code-env/lib/python3.11/site-packages/mlflow/types/utils.py:407: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.
warnings.warn(
/opt/dataiku/code-env/lib/python3.11/site-packages/mlflow/types/utils.py:407: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details.
warnings.warn(
/opt/dataiku/code-env/lib64/python3.11/site-packages/sklearn/utils/_tags.py:354: DeprecationWarning: The XGBClassifier or classes from which it inherits use `_get_tags` and `_more_tags`. Please define the `__sklearn_tags__` method, or inherit from `sklearn.base.BaseEstimator` and/or other appropriate mixins such as `sklearn.base.TransformerMixin`, `sklearn.base.ClassifierMixin`, `sklearn.base.RegressorMixin`, and `sklearn.base.OutlierMixin`. From scikit-learn 1.7, not defining `__sklearn_tags__` will raise an error.
warnings.warn(
2025/11/04 02:40:46 WARNING mlflow.models.model: Failed to validate serving input example {
"dataframe_split": {
"columns": [
"class of worker_Not in universe",
"class of worker_Private",
"class of worker_Self-employed-not incorporated",
"class of worker_infrequent_sklearn",
"sex_Male",
"education_Children",
"education_High school graduate",
"education_Some college but no degree",
"education_infrequent_sklearn",
"marital stat_Married-civilian spouse present",
"marital stat_Never married",
"marital stat_Widowed",
"marital stat_infrequent_sklearn",
"full or part time employment stat_Full-time schedules",
"full or part time employment stat_Not in labor force",
"full or part time employment stat_infrequent_sklearn",
"detailed household and family stat_Child less18 never marr not in subfamily",
"detailed household and family stat_Householder",
"detailed household and family stat_Nonfamily householder",
"detailed household and family stat_Spouse of householder",
"detailed household and family stat_infrequent_sklearn",
"detailed household summary in household_Child under 18 never married",
"detailed household summary in household_Householder",
"detailed household summary in household_Other relative of householder",
"detailed household summary in household_Spouse of householder",
"detailed household summary in household_infrequent_sklearn",
"num persons worked for employer_1",
"num persons worked for employer_2",
"num persons worked for employer_3",
"num persons worked for employer_4",
"num persons worked for employer_6",
"num persons worked for employer_infrequent_sklearn",
"family members under 18_Not in universe",
"family members under 18_infrequent_sklearn",
"tax filer stat_Nonfiler",
"tax filer stat_Single",
"tax filer stat_infrequent_sklearn",
"age",
"wage per hour",
"capital gains",
"capital losses",
"dividends from stocks",
"weeks worked in year"
],
"data": [
[
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
73,
0,
0,
0,
0,
0
],
[
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
58,
0,
0,
0,
0,
52
],
[
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
18,
0,
0,
0,
0,
0
],
[
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
9,
0,
0,
0,
0,
0
],
[
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
10,
0,
0,
0,
0,
0
],
[
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
48,
1200,
0,
0,
0,
52
],
[
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
42,
0,
5178,
0,
0,
52
],
[
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
28,
0,
0,
0,
0,
30
],
[
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
1.0,
0.0,
0.0,
0.0,
0.0,
47,
876,
0,
0,
0,
52
],
[
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
1.0,
0.0,
1.0,
0.0,
0.0,
0.0,
0.0,
34,
0,
0,
0,
0,
52
]
]
}
}. Alternatively, you can avoid passing input example and pass model signature instead when logging the model. To ensure the input example is valid prior to serving, please try calling `mlflow.models.validate_serving_input` on the model uri and serving input example. A serving input example can be generated from model input example using `mlflow.models.convert_input_example_to_serving_input` function.
Got error: 'super' object has no attribute '__sklearn_tags__'
2025/11/04 02:40:46 INFO mlflow.tracking._tracking_service.client: 🏃 View run study at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/study_9en. 2025/11/04 02:40:46 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning.
Lets see the best results:
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Best trial:
Value: 0.9707999718025174
Params:
subsample: 0.8699372174220117
colsample_bytree: 0.25372555634768734
n_estimators: 60
max_depth: 16
multiplier: 50
Let's see how the HP choices impacted performance:
study_df = study.trials_dataframe()
study_df.head()
| number | value | datetime_start | datetime_complete | duration | params_colsample_bytree | params_max_depth | params_multiplier | params_n_estimators | params_subsample | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.961414 | 2025-11-04 02:32:51.310074 | 2025-11-04 02:32:54.722548 | 0 days 00:00:03.412474 | 0.383268 | 2 | 33 | 30 | 0.825605 | COMPLETE |
| 1 | 1 | 0.960259 | 2025-11-04 02:32:54.723047 | 2025-11-04 02:33:10.111392 | 0 days 00:00:15.388345 | 0.701591 | 16 | 40 | 70 | 0.847854 | COMPLETE |
| 2 | 2 | 0.935299 | 2025-11-04 02:33:10.111876 | 2025-11-04 02:33:22.851998 | 0 days 00:00:12.740122 | 0.504809 | 8 | 18 | 100 | 0.853780 | COMPLETE |
| 3 | 3 | 0.877043 | 2025-11-04 02:33:22.852478 | 2025-11-04 02:33:29.236187 | 0 days 00:00:06.383709 | 0.203223 | 16 | 9 | 40 | 0.645437 | COMPLETE |
| 4 | 4 | 0.919627 | 2025-11-04 02:33:29.236681 | 2025-11-04 02:33:34.799212 | 0 days 00:00:05.562531 | 0.992821 | 8 | 13 | 40 | 0.986995 | COMPLETE |
Overall, there is not much variance in the score except for some combinations where likely overfitting occurs (e.g., max_depth and n_estimators both high).
We will look at different projections of the HP space:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
The patterns are not fully clear here. We expect the best performing models with the mid-to-higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the number of estimators is high).
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations",
)
fig.update_xaxes(
scaleanchor="x",
)
fig.show()
In our experiemnts, the high scores often corresponded with smaller col samples (how many col's each estimater used) unless max depth was singificantly lowered. The small col sample again helps avoid overfitting although the patterns are maybe less clear.
pivot = pd.pivot_table(study_df, index="params_n_estimators", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations",
)
fig.update_xaxes(
scaleanchor="x",
)
fig.show()
While the patterns might not be the most clear here, we expect to see that having high boosting rounds and high sample leads to lower scores (the bottom right corner, likely overfitting again).
Given that some of the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger step size for boosting rounds.
Finally, we look at the multiplier effect:
pd.pivot_table(study_df, index="params_multiplier", values="value", aggfunc='mean').T
| params_multiplier | 3 | 4 | 9 | 12 | 13 | 18 | 21 | 23 | 26 | 27 | 30 | 31 | 33 | 34 | 36 | 37 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 50 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| value | 0.732528 | 0.807614 | 0.877043 | 0.901628 | 0.919627 | 0.935299 | 0.937174 | 0.930859 | 0.940255 | 0.959021 | 0.945025 | 0.953824 | 0.961414 | 0.96508 | 0.958984 | 0.95726 | 0.962968 | 0.960837 | 0.962109 | 0.96718 | 0.954164 | 0.965285 | 0.963524 | 0.965257 | 0.966487 | 0.967324 | 0.967828 |
On average, the higher the multiplier the better the aucpr score we got which is also show in the heatmaps below. It looks like we get the most benefit around >40 weighting.
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
This patter is nicely shown the heatmap above and below as well.
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_max_depth", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
We save the predicted class and probabilities both calculated:
predictions_learn_df = pd.DataFrame(
{
TARGET: y_train,
'pred': model.predict(X_train),
'pred_proba': model.predict_proba(X_train)[:, 1]
}
)
predictions_test_df = pd.DataFrame(
{
TARGET: y_test,
'pred': model.predict(X_test),
'pred_proba': model.predict_proba(X_test)[:, 1]
}
)
Finally lets look at the feature importances for our model too (top 20):
fig = pd.DataFrame(
{
'importance': model.feature_importances_,
},
index=model.feature_names_in_
).sort_values('importance').tail(20).plot(kind="bar", backend='plotly')
fig.show()
Notes rerunning the notebook can significantly change the results. We highlight a few commonalities below.
Observations:
All these findings align with our expectations and EDA. Our model picked up on the gender bias in our data (there are much more high earner males in the dataset than female) which can definitely be addressed in future model iterations - please see the slides for more info.
# gender imbalance
processed_learn_df.groupby(TARGET).mean("sex_Male")["sex_Male"]
income 0 0.458790 1 0.785714 Name: sex_Male, dtype: float64
79% of high income earners were male, as opposed to 46% of low income. This statistical inparity is a strong signal for the model to pick up on and use for classification.
We finally save the results to their own datasets which can be used for evaluation:
# Write recipe outputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn.write_with_schema(predictions_learn_df)
predictions_test = dataiku.Dataset("predictions_test")
predictions_test.write_with_schema(predictions_test_df)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
152807 rows successfully written (mHvP8zt8CY)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
78826 rows successfully written (kWY6AUhgHx)
study_data = dataiku.Dataset("xgboost_study")
study_data.write_with_schema(study_df)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
40 rows successfully written (oP7EpQaxQa)